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Abstract 

Syntactic natural language parsers have 
shown themselves to be inadequate for pro- 
cessing highly-ambiguous large-vocabulary 
text, as is evidenced by their poor per- 
formance on domains like the Wall Street 
Journal, and by the movement away 
from parsing-based approaches to text- 
processing in general. In this paper, I de- 
scribe SPATTER, a statistical parser based 
on decision-tree learning techniques which 
constructs a complete parse for every sen- 
tence and achieves accuracy rates far bet- 
ter than any published result. This work 
is based on the following premises: (1) 
grammars are too complex and detailed to 
develop manually for most interesting do- 
mains; (2) parsing models must rely heav- 
ily on lexical and contextual information 
to analyze sentences accurately; and (3) 
existing n-gram modeling techniques are 
inadequate for parsing models. In exper- 
iments comparing SPATTER with IBM's 
computer manuals parser, SPATTER sig- 
nificantly outperforms the grammar-based 
parser. Evaluating SPATTER against the 
Penn Treebank Wall Street Journal corpus 
using the PARSEVAL measures, SPAT- 
TER achieves 86% precision, 86% recall, 
and 1.3 crossing brackets per sentence for 
sentences of 40 words or less, and 91% pre- 
cision, 90% recall, and 0.5 crossing brackets 
for sentences between 10 and 20 words in 
length. 
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1 Introduction 

Parsing a natural language sentence can be viewed as 
making a sequence of disambiguation decisions: de- 
termining the part-of-speech of the words, choosing 
between possible constituent structures, and select- 
ing labels for the constituents. Traditionally, disam- 
biguation problems in parsing have been addressed 
by enumerating possibilities and explicitly declaring 
knowledge which might aid the disambiguation pro- 
cess. However, these approaches have proved too 
brittle for most interesting natural language prob- 
lems. 

This work addresses the problem of automatically 
discovering the disambiguation criteria for all of the 
decisions made during the parsing process, given the 
set of possible features which can act as disambigua- 
tors. The candidate disambiguators are the words in 
the sentence, relationships among the words, and re- 
lationships among constituents already constructed 
in the parsing process. 

Since most natural language rules are not abso- 
lute, the disambiguation criteria discovered in this 
work are never applied deterministically. Instead, all 
decisions are pursued non-deterministically accord- 
ing to the probability of each choice. These proba- 
bilities are estimated using statistical decision tree 
models. The probability of a complete parse tree 
(T) of a sentence (S) is the product of each decision 
(di) conditioned on all previous decisions: 

P(T\S)= H P{d l \d l -id l -2.-.d l S). 

Each decision sequence constructs a unique parse, 
and the parser selects the parse whose decision se- 
quence yields the highest cumulative probability. By 
combining a stack decoder search with a breadth- 
first algorithm with probabilistic pruning, it is pos- 
sible to identify the highest-probability parse for any 
sentence using a reasonable amount of memory and 
time. 



The claim of this work is that statistics from 
a large corpus of parsed sentences combined with 
information-theoretic classification and training al- 
gorithms can produce an accurate natural language 
parser without the aid of a complicated knowl- 
edge base or grammar. This claim is justified by 
constructing a parser, called SPATTER (Statistical 
PATTErn Recognizer), based on very limited lin- 
guistic information, and comparing its performance 
to a state-of-the-art grammar-based parser on a 
common task. It remains to be shown that an accu- 
rate broad-coverage parser can improve the perfor- 
mance of a text processing application. This will be 
the subject of future experiments. 

One of the important points of this work is that 
statistical models of natural language should not 
be restricted to simple, context-insensitive models. 
In a problem like parsing, where long-distance lex- 
ical information is crucial to disambiguate inter- 
pretations accurately, local models like probabilistic 
context-free grammars are inadequate. This work 
illustrates that existing decision-tree technology can 
be used to construct and estimate models which se- 
lectively choose elements of the context which con- 
tribute to disambiguation decisions, and which have 
few enough parameters to be trained using existing 
resources. 

I begin by describing decision-tree modeling, 
showing that decision-tree models are equivalent to 
interpolated n-gram models. Then I briefly describe 
the training and parsing procedures used in SPAT- 
TER. Finally, I present some results of experiments 
comparing SPATTER with a grammarian's rule- 
based statistical parser, along with more recent re- 
sults showing SPATTER applied to the Wall Street 
Journal domain. 

2 Decision- Tree Modeling 

Much of the work in this paper depends on replac- 
ing human decision-making skills with automatic 
decision-making algorithms. The decisions under 
consideration involve identifying constituents and 
constituent labels in natural language sentences. 
Grammarians, the human decision-makers in pars- 
ing, solve this problem by enumerating the features 
of a sentence which affect the disambiguation deci- 
sions and indicating which parse to select based on 
the feature values. The grammarian is accomplish- 
ing two critical tasks: identifying the features which 
are relevant to each decision, and deciding which 
choice to select based on the values of the relevant 
features. 

Decision-tree classification algorithms account for 
both of these tasks, and they also accomplish a 



third task which grammarians classically find dif- 
ficult. By assigning a probability distribution to the 
possible choices, decision trees provide a ranking sys- 
tem which not only specifies the order of preference 
for the possible choices, but also gives a measure of 
the relative likelihood that each choice is the one 
which should be selected. 

2.1 What is a Decision Tree? 

A decision tree is a decision-making device which 
assigns a probability to each of the possible choices 
based on the context of the decision: P(f\h), where 
/ is an element of the future vocabulary (the set of 
choices) and h is a history (the context of the de- 
cision). This probability P(f\h) is determined by 
asking a sequence of questions q\q2 ■ ■ ■ q n about the 
context, where the zth question asked is uniquely de- 
termined by the answers to the i — 1 previous ques- 
tions. 

For instance, consider the part-of-speech tagging 
problem. The first question a decision tree might 
ask is: 

1. What is the word being tagged? 

If the answer is the, then the decision tree needs 
to ask no more questions; it is clear that the deci- 
sion tree should assign the tag / = determiner with 
probability 1. If, instead, the answer to question I is 
bear, the decision tree might next ask the question: 

2. What is the tag of the previous word? 

If the answer to question 2 is determiner, the de- 
cision tree might stop asking questions and assign 
the tag / = noun with very high probability, and 
the tag / = verb with much lower probability. How- 
ever, if the answer to question 2 is noun, the decision 
tree would need to ask still more questions to get a 
good estimate of the probability of the tagging deci- 
sion. The decision tree described in this paragraph 
is shown in Figure [l]. 

Each question asked by the decision tree is repre- 
sented by a tree node (an oval in the figure) and the 
possible answers to this question are associated with 
branches emanating from the node. Each node de- 
fines a probability distribution on the space of pos- 
sible decisions. A node at which the decision tree 
stops asking questions is a leaf node. The leaf nodes 
represent the unique states in the decision-making 
problem, i.e. all contexts which lead to the same 
leaf node have the same probability distribution for 
the decision. 

2.2 Decision Trees vs. n-grams 

A decision-tree model is not really very different 
from an interpolated n-gram model. In fact, they 




P(noun | bear, determiner)=0.8 
P(verb | bear, determiner)=0.2 



Figure 1: Partially-grown decision tree for part-of- 
speech tagging. 



are equivalent in representational power. The main 
differences between the two modeling techniques are 
how the models are parameterized and how the pa- 
rameters are estimated. 

2.2.1 Model Parameterization 

First, let's be very clear on what we mean by an 
n-gram model. Usually, an n-gram model refers to a 
Markov process where the probability of a particular 
token being generating is dependent on the values 
of the previous n — 1 tokens generated by the same 
process. By this definition, an n-gram model has 
\W\ n parameters, where \ W\ is the number of unique 
tokens generated by the process. 

However, here let's define an n-gram model more 
loosely as a model which defines a probability distri- 
bution on a random variable given the values of n — 1 
random variables, P(f\hih 2 ■ ■ ■ h n -\). There is no 
assumption in the definition that any of the random 
variables F or Hi range over the same vocabulary. 
The number of parameters in this n-gram model is 

Using this definition, an n-gram model can be 
represented by a decision-tree model with n — 1 
questions. For instance, the part-of-speech tagging 
model P(tj|wjtj_itj_2) can be interpreted as a 4- 
gram model, where Hi is the variable denoting the 
word being tagged, H2 is the variable denoting the 
tag of the previous word, and H 3 is the variable de- 
noting the tag of the word two words back. Hence, 
this 4-gram tagging model is the same as a decision- 
tree model which always asks the sequence of 3 ques- 
tions: 

1. What is the word being tagged? 

2. What is the tag of the previous word? 

3. What is the tag of the word two words back? 



But can a decision-tree model be represented by 
an n-gram model? No, but it can be represented 
by an interpolated n-gram model. The proof of this 
assertion is given in the next section. 

2.2.2 Model Estimation 

The standard approach to estimating an n-gram 
model is a two step process. The first step is to count 
the number of occurrences of each n-gram from a 
training corpus. This process determines the empir- 
ical distribution, 



P(f\hih*... hn-i) 



Count(/iift.2 . . • h n -xf) 
Count(/ii/i2 . . • h n —x) 



The second step is smoothing the empirical distri- 
bution using a separate, held-out corpus . This step 
improves the empirical distribution by finding statis- 
tically unreliable parameter estimates and adjusting 
them based on more reliable information. 

A commonly-used technique for smoothing is 
deleted interpolation. Deleted interpolation es- 
timates a model P(f\hxh 2 ... h n -i) by us- 
ing a linear combination of empirical models 
P(f\hk 1 hj e2 . . . hk m ), where m < n and 
fcj_i < fcj < n for alH < m. For example, a model 
P(f\hih 2 h 3 ) might be interpolated as follows: 

P{f\hih 2 h 3 ) = 

X 1 (h 1 h 2 h 3 )P(f\h 1 h2h 3 ) + 

X2(hih 2 h 3 )P{f\h 1 h 2 ) + X3(hih 2 h 3 )P(f\h 1 h 3 ) + 
\4(hih 2 h 3 )P{f\h 2 h 3 ) + X b {h x h 2 h 3 )P{f\h x h 2 ) + 
\*{hxh2ha)P(f\hi) + \7{hxh 2 h 3 )P{f\h 2 ) + 
\s(hih 2 h 3 )P(f\h 3 ) 



where ^2 Xi(hih 2 h 3 ) = 1 for all histories h\}i2h 3 . 
The optimal values for the Xi functions can be 
estimated using the forward-backward algorithm 
flBaum, 1972] ) . 

A decision-tree model can be represented by an 
interpolated n-gram model as follows. A leaf node in 
a decision tree can be represented by the sequence of 
question answers, or history values, which leads the 
decision tree to that leaf. Thus, a leaf node defines 
a probability distribution based on values of those 
questions: P{f\h] tl h] C2 . . . hf. m ), where m < n and 
fcj_i < ki < n, and where is the answer to one 
of the questions asked on the path from the root to 
the leaf.[] But this is the same as one of the terms 
in the interpolated n-gram model. So, a decision 



1 Note that in a decision tree, the leaf distribution is 
not affected by the order in which questions are asked. 
Asking about hi followed by hi yields the same future 
distribution as asking about /12 followed by h\. 



tree can be denned as an interpolated n-gram model 
where the A, function is defined as: 



1 if /ife 1 /ife 2 . . . hk m is a leaf, 
otherwise. 



2.3 Decision- Tree Algorithms 

The point of showing the equivalence between n- 
gram models and decision-tree models is to make 
clear that the power of decision-tree models is not 
in their expressiveness, but instead in how they can 
be automatically acquired for very large modeling 
problems. As n grows, the parameter space for an 
n-gram model grows exponentially and it quickly 
becomes computationally infeasible to estimate the 
smoothed model using deleted interpolation. Also, 
as n grows large, the likelihood that the deleted in- 
terpolation process will converge to an optimal or 
even near-optimal parameter setting becomes van- 
ishingly small. 

On the other hand, the decision-tree learning al- 
gorithm increases the size of a model only as the 
training data allows. Thus, it can consider very large 
history spaces, i.e. n-gram models with very large n. 
Regardless of the value of n, the number of param- 
eters in the resulting model will remain relatively 
constant, depending mostly on the number of train- 
ing examples. 

The leaf distributions in decision trees are empiri- 
cal estimates, i.e. relative-frequency counts from the 
training data. Unfortunately, they assign probabil- 
ity zero to events which can possibly occur. There- 
fore, just as it is necessary to smooth empirical 71- 
gram models, it is also necessary to smooth empirical 
decision-tree models. 

The decision-tree learning algorithms used in this 
work were developed over the past 15 years by 



the I BM Speech Recognition group ( Bahl et al 
198S). The growing algorithm is an adaptation of 



the CART algorithm in ( [Breiman et al., 198J ). For 
detailed descriptions and discussions of the decision- 



tree algorithms used in this work, see (Magerman 
T994[). 



An important point which has been omitted from 
this discussion of decision trees is the fact that only 
binary questions are used in these decision trees. A 
question which has k values is decomposed into a se- 
quence of binary questions using a classification tree 
on those k values. For example, a question about a 
word is represented as 30 binary questions. These 
30 questions are determined by growing a classifi- 
cation tree on the word vocabulary as described in 
( Brown et al., 1992 ). The 30 questions represent 30 
different binary partitions of the word vocabulary, 



and these questions are defined such that it is possi- 
ble to identify each word by asking all 30 questions. 
For more discussion of the use of binary decision-tree 
questions, see (Magerman, 1994). 



3 SPATTER Parsing 

The SPATTER parsing algorithm is based on inter- 
preting parsing as a statistical pattern recognition 
process. A parse tree for a sentence is constructed 
by starting with the sentence's words as leaves of 
a tree structure, and labeling and extending nodes 
these nodes until a single-rooted, labeled tree is con- 
structed. This pattern recognition process is driven 
by the decision-tree models described in the previous 
section. 

3.1 SPATTER Representation 

A parse tree can be viewed as an n-ary branching 
tree, with each node in a tree labeled by either a 
non-terminal label or a part-of-speech label. If a 
parse tree is interpreted as a geometric pattern, a 
constituent is no more than a set of edges which 
meet at the same tree node. For instance, the noun 
phrase, "a brown cow," consists of an edge extending 
to the right from "a," an edge extending to the left 
from "cow," and an edge extending straight up from 
"brown" . 




a 




brown 




cow 



Figure 2: Representation of constituent and labeling 
of extensions in SPATTER. 

In SPATTER, a parse tree is encoded in terms 
of four elementary components, or features: words, 
tags, labels, and extensions. Each feature has a fixed 
vocabulary, with each element of a given feature vo- 
cabulary having a unique representation. The word 
feature can take on any value of any word. The tag 
feature can take on any value in the part-of-speech 
tag set. The label feature can take on any value in 
the non-terminal set. The extension can take on any 
of the following five values: 

right - the node is the first child of a constituent; 

left - the node is the last child of a constituent; 

up - the node is neither the first nor the last child 
of a constituent; 

unary - the node is a child of a unary constituent; 



root - the node is the root of the tree. 

For an n word sentence, a parse tree has n leaf 
nodes, where the word feature value of the ith leaf 
node is the ith word in the sentence. The word fea- 
ture value of the internal nodes is intended to con- 
tain the lexical head of the node's constituent. A 
deterministic lookup table based on the label of the 
internal node and the labels of the children is used 
to approximate this linguistic notion. 

The SPATTER representation of the sentence 

(S (N Each_DDl code_NNl 
(Tn used_VVN 

(P by_II (N the_AT PC_NN1)))) 
(V is_VBZ listed_VVN)) 

is shown in Figure 0. The nodes are constructed 
bottom- up from left-to-right, with the constraint 
that no constituent node is constructed until all of its 
children have been constructed. The order in which 
the nodes of the example sentence are constructed 
is indicated in the figure. 




Figure 3: Treebank analysis encoded using feature 
values. 



3.2 Training SPATTER'S models 

SPATTER consists of three main decision-tree 
models: a part-of-speech tagging model, a node- 
extension model, and a node-labeling model. 



Each of these decision-tree models are grown using 
the following questions, where X is one of word, tag, 
label, or extension, and Y is either left and right: 

• What is the X at the current node? 

• What is the X at the node to the Yl 

• What is the X at the node two nodes to the Yl 

• What is the X at the current node's first child 
from the Yl 

• What is the X at the current node's second 
child from the Yl 

For each of the nodes listed above, the decision tree 
could also ask about the number of children and span 
of the node. For the tagging model, the values of the 
previous two words and their tags are also asked, 
since they might differ from the head words of the 
previous two constituents. 

The training algorithm proceeds as follows. The 
training corpus is divided into two sets, approx- 
imately 90% for tree growing and 10% for tree 
smoothing. For each parsed sentence in the tree 
growing corpus, the correct state sequence is tra- 
versed. Each state transition from Si to Sj+i is an 
event; the history is made up of the answers to all of 
the questions at state Si and the future is the value 
of the action taken from state Si to state Each 
event is used as a training example for the decision- 
tree growing process for the appropriate feature's 
tree (e.g. each tagging event is used for growing 
the tagging tree, etc.). After the decision trees are 
grown, they are smoothed using the tree smoothing 
corpus using a variation of the deleted interpolation 
algorithm described in (Magcrman, 1994). 



3.3 Parsing with SPATTER 

The parsing procedure is a search for the highest 
probability parse tree. The probability of a parse 
is just the product of the probability of each of the 
actions made in constructing the parse, according to 
the decision-tree models. 

Because of the size of the search space, (roughly 
0{\T\ n \N\ n ), where \T\ is the number of part-of- 
speech tags, n is the number of words in the sen- 
tence, and \N\ is the number of non-terminal labels), 
it is not possible to compute the probability of every 
parse. However, the specific search algorithm used 
is not very important, so long as there are no search 
errors. A search error occurs when the the high- 
est probability parse found by the parser is not the 
highest probability parse in the space of all parses. 

SPATTER'S search procedure uses a two phase 
approach to identify the highest probability parse of 



a sentence. First, the parser uses a stack decoding 
algorithm to quickly find a complete parse for the 
sentence. Once the stack decoder has found a com- 
plete parse of reasonable probability (> 10~ 5 ), it 
switches to a breadth-first mode to pursue all of the 
partial parses which have not been explored by the 
stack decoder. In this second mode, it can safely 
discard any partial parse which has a probability 
lower than the probability of the highest probabil- 
ity completed parse. Using these two search modes, 
SPATTER guarantees that it will find the highest 
probability parse. The only limitation of this search 
technique is that, for sentences which are modeled 
poorly, the search might exhaust the available mem- 
ory before completing both phases. However, these 
search errors conveniently occur on sentences which 
SPATTER is likely to get wrong anyway, so there 
isn't much performance lossed due to the search er- 
rors. Experimentally, the search algorithm guaran- 
tees the highest probability parse is found for over 
96% of the sentences parsed. 

4 Experiment Results 

In the absence of an NL system, SPATTER can be 
evaluated by comparing its top-ranking parse with 
the treebank analysis for each test sentence. The 
parser was applied to two different domains, IBM 
Computer Manuals and the Wall Street Journal. 

4.1 IBM Computer Manuals 

The first experiment uses the IBM Computer Man- 
uals domain, which consists of sentences extracted 
from IBM computer manuals. The training and test 
sentences were annotated by the University of Lan- 
caster. The Lancaster treebank uses 195 part-of- 
speech tags and 19 non-terminal labels. This tree- 



bank is described in great detail in (Black et al. 
1991 ) . 



The main reason for applying SPATTER to this 
domain is that IBM had spent the previous ten 
years developing a rule-based, unification-style prob- 
abilistic context-free grammar for parsing this do- 
main. The purpose of the experiment was to esti- 
mate SPATTER'S ability to learn the syntax for this 
domain directly from a treebank, instead of depend- 
ing on the interpretive expertise of a grammarian. 

The parser was trained on the first 30,800 sen- 
tences from the Lancaster treebank. The test set 
included 1,473 new sentences, whose lengths range 
from 3 to 30 words, with a mean length of 13.7 
words. These sentences are the same test sentences 
used in the experiments reported for IBM's parser 
in (|Black et al., 1993|). In ([Black et al., 1993|), 



IBM's parser was evaluated using the 0-crossing- 
brackets measure, which represents the percentage 
of sentences for which none of the constituents in 
the parser's parse violates the constituent bound- 
aries of any constituent in the correct parse. After 
over ten years of grammar development, the IBM 
parser achieved a O-crossing-brackets score of 69%. 
On this same test set, SPATTER scored 76%. 

4.2 Wall Street Journal 

The experiment is intended to illustrate SPATTER'S 
ability to accurately parse a highly-ambiguous, 
large-vocabulary domain. These experiments use 
the Wall Street Journal domain, as annotated in the 
Penn Treebank, version 2. The Penn Treebank uses 
46 part-of-speech tags and 27 non-terminal labels]^] 

The WSJ portion of the Penn Treebank is divided 
into 25 sections, numbered 00 - 24. In these exper- 
iments, SPATTER was trained on sections 02-21, 
which contains approximately 40,000 sentences. The 
test results reported here are from section 00, which 
contains 1920 sentences.^ Sections 01, 22, 23, and 
24 will be used as test data in future experiments. 

The Penn Treebank is already tokenized and sen- 
tence detected by human annotators, and thus the 
test results reported here reflect this. SPATTER 
parses word sequences, not tag sequences. Further- 
more, SPATTER does not simply pre-tag the sen- 
tences and use only the best tag sequence in parsing. 
Instead, it uses a probabilistic model to assign tags 
to the words, and considers all possible tag sequences 
according to the probability they are assigned by the 
model. No information about the legal tags for a 
word are extracted from the test corpus. In fact, no 
information other than the words is used from the 
test corpus. 

For the sake of efficiency, only the sentences of 40 
words or fewer are included in these experiments.^] 
For this test set, SPATTER takes on average 12 



2 This treebank also contains coreference information, 
predicate- argument relations, and trace information in- 
dicating movement; however, none of this additional in- 
formation was used in these parsing experiments. 

3 For an independent research project on coreference, 
sections 00 and 01 have been annotated with detailed 
coreference information. A portion of these sections is 
being used as a development test set. Training SPAT- 
TER on them would improve parsing accuracy signifi- 
cantly and skew these experiments in favor of parsing- 
based approaches to coreference. Thus, these two sec- 
tions have been excluded from the training set and re- 
served as test sentences. 

4 SPATTER returns a complete parse for all sentences 
of fewer then 50 words in the test set, but the sentences 
of 41 - 50 words required much more computation than 
the shorter sentences, and so they have been excluded. 



seconds per sentence on an SGI R4400 with 160 
megabytes of RAM. 

To evaluate SPATTER'S performance on this do- 
main, I am using the PARSEVAL measures, as de- 



fined in (Black et al., 1991) 



Precision 

no. of correct constituents in SPATTER parse 
no. of constituents in SPATTER parse 

Recall 

no. of correct constituents in SPATTER parse 
no. of constituents in treebank parse 

Crossing Brackets no. of constituents which vio- 
late constituent boundaries with a constituent 
in the treebank parse. 

The precision and recall measures do not consider 
constituent labels in their evaluation of a parse, since 
the treebank label set will not necessarily coincide 
with the labels used by a given grammar. Since 
SPATTER uses the same syntactic label set as the 
Penn Treebank, it makes sense to report labelled 
precision and labelled recall. These measures are 
computed by considering a constituent to be correct 
if and only if it's label matches the label in the tree- 
bank. 

Table | shows the results of SPATTER evaluated 
against the Penn Treebank on the Wall Street Jour- 
nal section 00. 



Sent. Length Range 


4-40 


4-25 


10-20 


Comparisons 


1759 


1114 


653 


Avg. Sent. Length 


22.3 


16.8 


15.6 


Treebank Constituents 


17.58 


13.21 


12.10 


Parse Constituents 


17.48 


13.13 


12.03 


Tagging Accuracy 


96.5% 


96.6% 


96.5% 


Crossings Per Sentence 


1.33 


0.63 


0.49 


Sent, with Crossings 


55.4% 


69.8% 


73.8% 


Sent, with 1 Crossing 


69.2% 


83.8% 


86.8% 


Sent, with 2 Crossings 


80.2% 


92.1% 


95.1% 


Precision 


86.3% 


89.8% 


90.8% 


Recall 


85.8% 


89.3% 


90.3% 


Labelled Precision 


84.5% 


88.1% 


89.0% 


Labelled Recall 


84.0% 


87.6% 


88.5% 



Table 1: Results from the WSJ Penn Treebank ex- 
periments. 

Figures ||, ||, and (?] illustrate the performance of 
SPATTER as a function of sentence length. SPAT- 
TER'S performance degrades slowly for sentences up 



to around 28 words, and performs more poorly and 
more erratically as sentences get longer. Figure [| in- 
dicates the frequency of each sentence length in the 
test corpus. 




4 6 8 1 1 2 1 4 1 6 1 8 20 22 24 26 28 30 32 34 36 38 40 

Sentence Length 

Figure 4: Frequency in the test corpus as a function 
of sentence length for Wall Street Journal experi- 
ments. 




8 1 1 2 1 4 1 6 18 20 22 24 26 28 30 32 34 36 38 40 

Sentence Length 

Figure 5: Number of crossings per sentence as a 
function of sentence length for Wall Street Journal 
experiments. 



5 Conclusion 

Regardless of what techniques are used for parsing 
disambiguation, one thing is clear: if a particular 
piece of information is necessary for solving a dis- 
ambiguation problem, it must be made available to 
the disambiguation mechanism. The words in the 
sentence are clearly necessary to make parsing de- 
cisions, and in some cases long-distance structural 
information is also needed. Statistical models for 




4 6 8 1 1 2 1 4 1 6 1 8 20 22 24 26 28 30 32 34 36 38 40 

Sentence Length 

Figure 6: Percentage of sentence with 0, 1, and 2 
crossings as a function of sentence length for Wall 
Street Journal experiments. 



100% 




75% 1 1 
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Sentence Length 



parsing need to consider many more features of a 
sentence than can be managed by n-gram modeling 
techniques and many more examples than a human 
can keep track of. The SPATTER parser illustrates 
how large amounts of contextual information can be 
incorporated into a statistical model for parsing by 
applying decision-tree learning algorithms to a large 
annotated corpus. 

References 

[Bahl et al.1989] L. R. Bahl, P. F. Brown, P. V. deS- 
ouza, and R. L. Mercer. 1989. A tree-based 
statistical language model for natural language 
speech recognition. IEEE Transactions on Acous- 
tics, Speech, and Signal Processing, Vol. 36, No. 
7, pages 1001-1008. 

[Bauml972] L. E. Baum. 1972. An inequality and as- 
sociated maximization technique in statistical es- 
timation of probabilistic functions of markov pro- 
cesses. Inequalities, Vol. 3, pages 1-8. 

[Black ct al.1991] E. Black and ct al. 1991. A pro- 
cedure for quantitatively comparing the syntactic 
coverage of english grammars. Proceedings of the 
February 1991 DARPA Speech and Natural Lan- 
guage Workshop, pages 306-311. 

[Black et al.1993] E. Black, R. Garside, 
and G. Leech. 1993. Statistically- driven computer 
grammars of english: the ibm/lancaster approach. 
Rodopi, Atlanta, Georgia. 

[Brciman et al.1984] L. Brciman, J. H. Friedman, 
R. A. Olshcn, and C. J. Stone. 1984. Classi- 
fication and Regression Trees. Wadsworth and 
Brooks, Pacific Grove, California. 

[Brown et al.1992] P. F. Brown, V. Delia Pietra, 
P. V. deSouza, J. C. Lai, and R. L. Mercer. 1992. 
"Class-based n-gram models of natural language." 
Computational Linguistics, 18(4), pages 467-479. 

[Magermanl994] D. M. Magerman. 1994. Natural 
Language Parsing as Statistical Pattern Recogni- 
tion. Doctoral dissertation. Stanford University, 
Stanford, California. 



Figure 7: Precision and recall as a function of sen- 
tence length for Wall Street Journal experiments. 



